1
RAG 私有數據準備入門
AI010Lesson 7
00:00

RAG 的基礎

Standard Large Language Models (LLMs) are "frozen" in time, limited by their training data cut-off. They cannot answer questions about your company’s internal handbook or a private video meeting from yesterday. 檢索增強生成(RAG)透過從您自身的私有資料中提取相關上下文,彌補此差距。

多階段工作流程

為讓私有資料對語言模型更具可讀性,我們遵循特定流程:

  • 載入:將各種格式(PDF、網頁、YouTube)轉換為標準文件格式。
  • 分割:將長文件分割成較小且易於管理的「片段」。
  • 嵌入:將文字片段轉換為數值向量(意義的數學表示)。
  • 儲存:將這些向量儲存在向量資料庫(如 Chroma)中,以實現極快的相似度搜尋。
為什麼分塊至關重要?
語言模型具有「上下文窗口」(一次能處理的文字量上限)。若傳送一份百頁的 PDF,模型將無法運作。我們將資料分塊,以確保僅傳遞最相關的資訊給模型。
main.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
Question 1
Why is chunk_overlap considered a critical parameter when splitting documents for RAG?
To reduce the total number of tokens used by the LLM.
To ensure that semantic context (the meaning of a thought) is not cut off at the end of a chunk.
To make the vector database store data faster.
Challenge: Preserving Context
Apply your knowledge to a real-world scenario.
You are loading a YouTube transcript for a technical lecture. You notice that the search results are confusing "Lecture 1" content with "Lecture 2."
Task
Which splitter would be best for keeping context like "Section Headers" intact?
Solution:
MarkdownHeaderTextSplitter or RecursiveCharacterTextSplitter. These allow you to maintain document structure in the metadata, helping the retrieval system distinguish between different chapters or lectures.